Sequential Pattern Discovery under a Markov Assumption

نویسندگان

  • Darya Chudova
  • Padhraic Smyth
چکیده

In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. There are a number of fundamental aspects of this data mining problem that can make discovery “easy” or “hard”—we characterize the difficulty of learning in this context using an analysis based on the Bayes error rate under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generative Modeling of Itemset Sequences Derived from Real Databases

The problem of discovering temporal and attribute dependencies from multi-sets of events derived from realworld databases can be mapped as a sequential pattern mining task. Although generative approaches can offer a critical compact and probabilistic view of sequential patterns, existing contributions are only prepared to deal with sequences with a fixed multivariate order. Thus, this work targ...

متن کامل

A novel grey–fuzzy–Markov and pattern recognition model for industrial accident forecasting

Industrial forecasting is a top-echelon research domain, which has over the past several years experienced highly provocative research discussions. The scope of this research domain continues to expand due to the continuous knowledge ignition motivated by scholars in the area. So, more intelligent and intellectual contributions on current research issues in the accident domain will potentially ...

متن کامل

Does Fundraising Have Meaningful Sequential Patterns? The Case of Fintech Startups

Nowadays, fundraising is one of the most important issues for both Fintech investors and startups. The pattern of fundraising in terms of “number and type of rounds and stages needed” are important. The diverse features and factors that could stem from Fintech business models which can influence success are of the key issues in shaping these patterns. This study applied the top 100 KPMG Fintech...

متن کامل

Data Mining for Web Personalization

In this chapter we present an overview of Web personalization process viewed as an application of data mining requiring support for all the phases of a typical data mining cycle. These phases include data collection and preprocessing, pattern discovery and evaluation, and finally applying the discovered knowledge in real-time to mediate between the user and the Web. This view of the personaliza...

متن کامل

A reservoir-driven non-stationary hidden Markov model

In this work, we propose a novel approach towards sequential data modeling that leverages the strengths of hidden Markov models and echo-state networks (ESNs) in the context of nonparametric Bayesian inference approaches. We introduce a non-stationary hidden Markov model, the time-dependent state transition probabilities of which are driven by a high-dimensional signal that encodes the whole hi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002